PWN Your Infrastructure:
Behind Call of Duty: World at War

Jason LaPorte (jason@agoragames.com)
Agora Games (http://www.agoragames.com/)

What does Agora Games do?

We make video games awesome.
(And build community-driven websites.)

What does Agora Games do?

What does Agora Games do?

What does Agora Games do?

What does Agora Games do?

What does Agora Games do?

Who am I?

Infrastructure

  1. Web Server (NGINX)
  2. Load Balancer (HAProxy)
  3. Application Stack (Thin, Rails)
  4. Database (MySQL)
  1. Operating System (Ubuntu Linux)
  2. Network (Firewalls, NFS)
  3. Tools

I'm going to focus on the system side.

(For the application-level stuff, check out
our Guitar Hero talk later today!)

What's wrong with a
typical Rails deployment?

Scalability!

Scalability!

(Of the administrator's time.)

Scalability!

Ideally:

Scalability!

In the real world:

Scalability!

(If you're me...)

One server is trivial.

Three servers is easy.

Twelve servers is tricky.

Fifty servers is a time sink.

What starts to fail?

  1. System updates/fixes take forever.
  2. Hardware errors become frequent.
  3. Transient network failures occur more often.
  4. Unpredictable failures become common.
  5. Design shortcomings become apparent.
  6. etc.

Capistrano would help, but...

  1. It's synchronous.
  2. Failures aren't localized.
  3. Ruby isn't a clean way to specify
    system tasks anyway.

We also need automation and
centralized monitoring!

There's a lot that
needs fixing here.

  1. Failures must be designed around.
  2. Repetitive tasks must be abstracted away.
  3. Monitoring information needs
    to be made accessible.
  4. Deploys need to be centralized
    and simplified.

Design Goals

  1. KISS
  2. When in Rome...
  3. Teach a man to fish...

Designing Around Failures

  1. Virtualization for hardware problems:
    Terremark (http://www.terremark.com/).
  2. Replication for software problems.

(Virtualization also gives us
a lot of flexibility!)

Abstracting Repetition

/usr/local is propagated via NFS.
(Updating code is quick and painless.)

(Well, almost painless.)

Abstracting Repetition

What about configuring a myriad of servers?

Abstracting Repetition

Well, we did what you usually do when
you have too many units to manage...

...we spawned more overlords.

Overlord

A (very) simple Rails app that does two things:

  1. Centralizes configuration.
  2. Aggregates monitoring information.

(Sorry, it's currently proprietary.)

Overlord

OVERLORD=overlord.example.com
HOSTNAME=`hostname`
CONFIG_URL=http://$OVERLORD/hosts/config/$HOSTNAME

curl -s $CONFIG_URL >/tmp/autoconfig.sh
/bin/sh /tmp/autoconfig.sh

Overlord

Two models. A Host has_many Configurations.

Overlord

Overlord

Each Configuration represents
a file on the host.

Overlord

Overlord

Monit does the rest.

Monit

http://mmonit.com/monit/

Monit

Monit

We rely on it for just about everything.

  1. System monitoring.
  2. Starting daemons.
  3. Ensuring liveness.
  4. Email alerts.
  5. etc.

Overlord

We pull XML from Monit, and feed it
into RRDTool for graphing.

http://.../_status?format=xml

RRDTool

http://oss.oetiker.ch/rrdtool/

RRDTool

Monit + RRDTool

  1. Nagios (clunky)
  2. ZABBIX (ditto)
  3. MMonit (see above)
  4. Cacti (great, but limited)
  5. Munin (wreaks havoc on system resources)

Centralizing Deploys

Deploying is really a three step process:

  1. Update your code.
  2. Update your environment.
  3. Restart your servers.

Centralizing Deploys

Code has to be propagated to all app servers.

We used to use Capistrano for this.
Now we're using NFS.

#!/bin/sh
# <set up variables here>

# deploy
svn -q export $REPOSITORY $NEW_RELEASE_DIR
chmod -R g+w $NEW_RELEASE_DIR

# symlink
rm -f $CURRENT_DIR
ln -s $NEW_RELEASE_DIR $CURRENT_DIR
ln -s $SHARED_DIR/log $CURRENT_DIR/log
ln -s $SHARED_DIR/pids $CURRENT_DIR/tmp/pids

# migrate
cd $CURRENT_DIR && rake db:migrate RAILS_ENV=production

# restart
touch $SHARED_DIR/pids/restart.touch

Deploying

Wrapping Up

  1. Virtualization.
  2. Mirrored filesystem (NFS).
  3. Monit + RRDTool = Overlord.
  4. Good old-fashioned scripting.
    Favoring existing conventions saves time.

Where next?

  1. Better abstractions for Overlord.
  2. Automagic zero-downtime
    configuration updates.
  3. Centralized logging, alerting
    (integrated with RRDTool).
  4. A better NFS...?